1 Learning Objectives - Explore

Exploratory Data Analysis (cont’d): - Pairs plot to show correlation between variables and avoid multicollinearity (see 8.2 Many predictors in a model) Logistic Regression seen as an evolution of techniques - Linear Model to show simplest multivariate regression, but predictions can be outside the binary values. - Generalized Linear Model uses a logit transformation to constrain the outputs to being within two values. - Generalized Additive Model allows for “wiggle” in predictor terms. - Maxent (Maximum Entropy) is a presence-only modeling technique that allows for a more complex set of shapes between predictor and response.

2 Load packages and environmental data layers

librarian::shelf(
  DT, dplyr, dismo, GGally, here, readr, tidyr)
select <- dplyr::select # overwrite raster::select
options(readr.show_col_types = F)

dir_data    <- here("data")
pts_env_csv <- file.path(dir_data, "pts_env.csv")

pts_env <- read_csv(pts_env_csv)
nrow(pts_env)
## [1] 1552

2.1 Displate environmental data in a table

datatable(pts_env, rownames = F)

2.2 Use GGally to look at pair plots and examine correlations between environmental variables

GGally::ggpairs(
  select(pts_env, -ID),
  aes(color = factor(present)))

# Logistic Regression ## Setup Data - drop rows of data with any NA values (later, we’ll learn how to impute values) - remove terms we don’t want to model

d <- pts_env %>% 
  select(-ID) %>%  # remove terms we don't want to model
  tidyr::drop_na() # drop the rows with NA values
nrow(d)
## [1] 1548

2.3 Linear Model

  • dependent variable Y: presence/absence, aka 1/0
  • independent variables X: everything else in the dataframe, aka the environmental layers
mod <- lm(present ~ ., data = d)
summary(mod)
## 
## Call:
## lm(formula = present ~ ., data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16297 -0.30808  0.06136  0.28021  1.09967 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -2.606e+00  9.115e-01  -2.859 0.004307 ** 
## WC_alt                    9.884e-05  1.015e-04   0.974 0.330322    
## WC_bio1                   1.975e-01  1.899e-02  10.397  < 2e-16 ***
## WC_bio4                   9.127e-04  2.103e-03   0.434 0.664422    
## WC_bio12                 -1.742e-03  2.291e-04  -7.606 4.89e-14 ***
## ER_climaticMoistureIndex  1.776e+00  2.233e-01   7.954 3.47e-15 ***
## ER_tri                    4.954e-03  1.340e-03   3.696 0.000227 ***
## ER_topoWet                3.068e-02  1.607e-02   1.909 0.056411 .  
## lon                       2.544e-02  2.550e-03   9.975  < 2e-16 ***
## lat                       1.128e-01  1.499e-02   7.519 9.31e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4017 on 1538 degrees of freedom
## Multiple R-squared:  0.3588, Adjusted R-squared:  0.3551 
## F-statistic: 95.63 on 9 and 1538 DF,  p-value: < 2.2e-16

Note: a linear model is ineffective because it predicts values outside the 0 - 1 range

y_predict <- predict(mod, pts_env, type="response")
y_true    <- pts_env$present

range(y_predict)
## [1] NA NA
range(y_true)
## [1] 0 1

3 Generalized Linear Model

To solve this problem, we will apply a Logit Transformation

# fit a generalized linear model with a binomial logit link function
mod <- glm(present ~ ., family = binomial(link="logit"), data = d)
summary(mod)
## 
## Call:
## glm(formula = present ~ ., family = binomial(link = "logit"), 
##     data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6979  -0.7890  -0.1476   0.7383   2.6407  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.677e+01  6.373e+00  -2.631 0.008516 ** 
## WC_alt                    1.781e-04  6.754e-04   0.264 0.792060    
## WC_bio1                   1.114e+00  1.318e-01   8.453  < 2e-16 ***
## WC_bio4                   2.136e-02  1.337e-02   1.597 0.110332    
## WC_bio12                 -1.051e-02  1.486e-03  -7.075 1.49e-12 ***
## ER_climaticMoistureIndex  1.150e+01  1.503e+00   7.653 1.97e-14 ***
## ER_tri                    3.434e-02  8.899e-03   3.859 0.000114 ***
## ER_topoWet                2.057e-01  1.008e-01   2.040 0.041324 *  
## lon                       1.253e-01  1.628e-02   7.696 1.40e-14 ***
## lat                       5.508e-01  1.118e-01   4.927 8.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2146.0  on 1547  degrees of freedom
## Residual deviance: 1493.1  on 1538  degrees of freedom
## AIC: 1513.1
## 
## Number of Fisher Scoring iterations: 5

Note: we are not within the 0 - 1 range we want to be in

y_predict <- predict(mod, d, type="response")
range(y_predict)
## [1] 0.009644674 0.973727980

3.1 Term Plots

termplot(mod, partial.resid = TRUE, se = TRUE, main = F)

4 Generalized Additive Model (GAM)

We can further improve the GLM by adding “wiggle” to the relationship between the predictor and response variables

librarian::shelf(mgcv)

# fit a generalized additive model with smooth predictors
mod <- mgcv::gam(
  formula = present ~ s(WC_alt) + 
                      s(WC_bio1) + 
                      s(WC_bio4) + 
                      s(WC_bio12) + 
                      s(ER_climaticMoistureIndex) + 
                      s(ER_tri) + 
                      s(ER_topoWet) + 
                      s(lon) + 
                      s(lat), 
  family = binomial, data = d)
summary(mod)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## present ~ s(WC_alt) + s(WC_bio1) + s(WC_bio4) + s(WC_bio12) + 
##     s(ER_climaticMoistureIndex) + s(ER_tri) + s(ER_topoWet) + 
##     s(lon) + s(lat)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -1.3946     0.5662  -2.463   0.0138 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df Chi.sq  p-value    
## s(WC_alt)                   3.780  4.476  8.753 0.058729 .  
## s(WC_bio1)                  5.456  6.579 36.937  < 2e-16 ***
## s(WC_bio4)                  5.523  6.748 21.327 0.003969 ** 
## s(WC_bio12)                 7.546  8.301 32.954 0.000126 ***
## s(ER_climaticMoistureIndex) 6.022  6.982 19.615 0.006694 ** 
## s(ER_tri)                   1.526  1.842 10.125 0.007122 ** 
## s(ER_topoWet)               4.315  5.375  9.900 0.081225 .  
## s(lon)                      8.766  8.944 59.976  < 2e-16 ***
## s(lat)                      6.803  7.803 26.436 0.000582 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.547   Deviance explained = 50.1%
## UBRE = -0.2428  Scale est. = 1         n = 1548

4.1 Term Plots

plot(mod, scale=0)